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S.R.  Kulkarni^’^  S.K.  Mitter^  J.N.  Tsitsiklis^  O.  Zeitouni'^ 

Jtdy  16,  1991 


Abstract 

In  this  paper,  we  introduce  an  extension  of  the  standard  PAC  model  which  allows  the  use 
of  generalized  samples.  We  view  a  generalized  sample  as  a  pair  consisting  of  a  functional  on 
the  concept  class  together  with  the  value  obtained  by  the  functional  operating  on  the  unknown 
concept.  It  appears  that  this  model  can  be  applied  to  a  number  of  problems  in  signal  processing 
and  geometric  reconstruction  to  provide  sample  size  bounds  under  a  PAC  criterion.  We  consider 
a  specific  application  of  the  generalized  model  to  a  problem  of  curve  reconstruction,  and  discuss 
some  connections  with  a  result  from  stochastic  geometry. 
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1  Introduction 


The  Probably  Approximately  Correct  (or  PAC)  learning  model  is  a  precise  framework  attempting 
to  formalize  the  notion  of  learning  from  examples.  The  earliest  work  on  PAC-like  models  was 
done  by  Vapnik  [21],  and  many  fundamental  results  relevant  to  the  PAC  model  have  been  obtained 
in  the  probability  and  statistics  literature  [19,  20,  5,  12].  Valiant  [18]  independently  proposed  a 
similar  model  wliich  has  resulted  in  a  great  deal  of  work  on  the  PAC  model  in  the  machine  learning 
community.  More  recently,  Haussler  [6]  has  formulated  a  very  general  framework  refining  and 
consolidating  much  of  the  previous  work  on  the  PAC  model. 

In  the  usual  PAC  model,  the  information  received  by  the  learner  consists  of  random  samples 
of  some  unknown  function.  Here  we  introduce  an  extension  in  which  the  learner  may  receive 
information  from  much  more  general  types  of  samples,  which  we  refer  to  as  generalized  samples.  A 
generalized  sample  is  essentially  a  functional  assigning  a  real  number  to  each  concept,  where  the 
nmnber  assigned  may  not  necessarily  be  the  value  of  the  unknown  concept  at  a  point,  but  could  be 
some  other  attribute  of  the  unknown  concept  (e.g.,  the  integral  over  a  region,  or  the  derivative  at 
a  given  point,  etc.).  The  model  is  defined  for  the  general  case  in  which  the  concepts  are  real  valued 
functions,  and  is  applicable  to  both  distribution-free  and  fixed  distribution  learnability.  The  idea 
is  simply  to  transform  learning  with  generalized  samples  to  a  problem  of  learning  with  standard 
samples  over  a  new  instance  space  and  concept  class.  The  PAC  learning  criteria  over  the  original 
space  is  induced  by  the  corresponding  standard  PAC  criteria  over  the  transformed  space.  Thus, 
the  criteria  for  learnability  and  sample  size  botmds  are  the  usual  ones  involving  metric  entropy  and 
a  generalization  of  VC  dimension  for  functions  (in  the  fixed  distribution  and  distribution-free  cases 
respectively). 

We  consider  a  particular  example  of  learning  from  generalized  samples  that  is  related  to  a 
classical  result  from  stochastic  geometry.  Namely,  we  take  X  to  be  the  unit  square  in  the  plane, 
and  consider  concept  classes  which  are  collections  of  curves  contained  in  X.  For  example,  one 
simple  concept  class  of  interest  is  the  set  of  straight  line  segments  contained  in  A.  A  much  more 
general  concept  class  we  consider  is  the  set  of  curves  in  X  with  bounded  length  and  bounded 
turn  (total  absolute  curvature).  The  samples  observed  by  the  learner  consist  of  randomly  drawn 
straight  lines  labeled  as  to  the  number  of  intersections  the  random  line  makes  with  the  target 
concept  (i.e.,  the  unknown  curve).  We  consider  learnability  with  respect  to  a  fixed  distribution, 
where  the  distribution  is  the  uniform  distribution  on  the  set  of  lines  intersecting  X.  A  learnability 
result  is  obtained  by  providing  metric  entropy  bounds  for  the  class  of  curves  under  consideration. 

The  example  of  learning  a  cmve  is  closely  related  to  a  result  from  stochastic  geometry  which 
states  that  the  expected  number  of  intersections  a  random  line  makes  with  an  arbitrary  rectifiable 
curve  is  proportional  to  the  length  of  the  curve.  This  result  suggests  that  the  length  of  a  curve  can 
be  estimated  (or  “learned”)  from  a  set  of  generalized  samples.  In  fact,  this  idea  has  been  studied, 
although  primarily  from  the  point  of  view  of  deterministic  sampling  [17,  11].  The  learnability  result 
makes  the  much  stronger  statement  that  for  certain  classes  of  curves,  from  just  knowing  the  number 
of  intersections  with  a  set  of  random  lines,  the  curve  itself  can  be  learned  (from  which  the  length 
can  then  be  estimated).  Also,  for  these  classes  of  curves,  the  learning  result  guarantees  uniform 
convergence  of  empirical  estimates  of  length  to  true  length,  which  does  not  follow  directly  from  the 
stochastic  geometry  result. 

Finally,  we  discuss  a  number  of  open  problems  and  directions  for  further  work.  We  believe  the 
framework  presented  here  can  be  applied  to  a  number  of  problems  in  signal/image  processing,  geo- 
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metric  reconstruction,  and  stereology,  to  provide  sample  size  bounds  under  a  PAC  criterion.  Some 
specific  problems  that  may  be  approachable  with  these  ideas  include  tomographic  reconstruction 
using  random  ray  or  projection  sampling  and  convex  set  reconstruction  from  support  line  or  other 
types  of  measurements. 


2  PAC  Learning  with  Generalized  Samples 

In  the  original  PAC  learning  model  [18],  a  concept  is  a  subset  of  some  instance  space  X,  and  a 
concept  class  C  is  a  collection  of  concepts.  The  learner  knows  C  and  tries  to  learn  a  target  concept 
c  belonging  to  C.  The  information  received  by  the  learner  consists  of  points  of  X  (drawn  randomly) 
and  labeled  as  to  whether  or  not  they  belong  to  the  target  concept.  The  goal  of  the  learner  is  to 
produce  with  high  probability  (greater  than  1  —  ^)  a  hypothesis  which  is  close  (within  e)  to  the 
target  concept  (hence  the  name  PAC  for  “probably  approximately  correct”).  It  is  assumed  that 
the  distribution  is  unknown  to  the  learner,  and  the  number  of  samples  needed  to  learn  for  fixed  e 
and  8  is  independent  of  the  unknown  concept  as  weU  as  the  rmknown  distribution  (hence  the  term 
“distribution-free”).  For  precise  definitions,  see  for  example  [18,  4]. 

Some  variations/extensions  of  the  original  model  that  have  been  studied  and  are  relevant  to  the 
present  work  include  learning  with  respect  to  a  fixed  distribution  [3,  6],  and  learning  functions  as 
opposed  to  sets  (i.e.,  binary  valued  functions)  [6].  As  the  name  suggests,  learning  with  respect  to 
a  fibced  distribution  refers  to  the  case  in  which  the  distribution  with  which  the  samples  aie  being 
drawn  is  fixed  and  known  to  the  learner.  A  very  general  framework  was  formulated  by  Haussler 
[6]  building  on  some  fundamental  work  by  Vapnik  and  Chervonenkis  [19,  20,  21],  Dudley  [6],  and 
PoUard  [12].  In  this  framework,  the  concept  class  (hypotheses),  denoted  by  JP,  is  a  collection 
functions  from  a  domain  X  to  a  range  Y.  The  samples  are  drawn  according  to  a  distribution  on 
X  X  Y  from  some  class  of  distributions.  A  loss  function  is  defined  on  F  xY,  and  the  goal  of 
the  learner  is  to  produce  a  hypothesis  from  F  which  is  close  to  the  optimal  one  in  the  sense  of 
minimizing  the  expected  loss  between  the  hypothesis  and  the  random  samples. 

Learning  from  generalized  samples  can  be  formulated  as  an  extension  of  the  framework  in  [6]  as 
briefly  described  at  the  end  of  this  section.  However,  for  simplicity  of  the  presentation  we  consider  a 
restricted  formulation  which  is  sufficiently  general  to  treat  the  example  of  learning  a  curve  discussed 
in  tliis  paper.  We  now  define  more  carefully  what  we  mean  by  learning  from  generalized  samples. 
Let  X  be  the  original  instance  space  as  before,  and  let  the  concept  class  F  be  a  collection  of  real 
valued  functions  on  X.  In  the  usual  model,  the  information  one  gets  are  samples  {x,f{x))  where 
X  £  X  and  where  /  €  F  is  the  target  concept.  We  can  view  this  as  obtaining  a  functional  8^.  and 
applying  this  functional  to  the  target  concept  /  to  obtain  the  sample  (^!C)^a!(/))  =  (^®5  /(*))•  The 
functional  in  this  case  simply  evaluates  /  at  the  point  a:,  and  is  chosen  randomly  from  the  class  of 
all  such  “impulse”  functionals.  Instead,  we  now  assume  we  get  generalized  samples  in  the  sense  that 
we  obtain  a  more  general  functional  which  is  some  mapping  from  F  to  R.  The  observed  labeled 
sample  is  then  (®,  *(/))  consisting  of  the  functional  and  the  real  number  obtained  by  applying  this 
functional  to  the  target  concept  /.  We  assume  the  functional  x  is  chosen  randomly  from  some 
collection  of  functionals  X.  Thus,  X  is  the  instance  space  for  the  generalized  samples,  and  the 
distribution  F  is  a  probability  measure  on  X.  Let  Sp  denote  the  set  of  labeled  m-samples  for  each 
m  >  1,  for  each  x  €  X,  and  each  /  G  F. 

Given  F,  we  can  define  an  error  criterion  (i.e.,  notion  of  distance  between  concepts)  with  respect 
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to  P  as 


dp{ft,  fz)  =  E\£{fi)  -  x{f2)\ 

This  is  simply  the  average  absolute  difference  of  real  nmnbers  produced  by  generalized  samples  on 
the  two  concepts.  Note  we  cotdd  define  the  framework  with  more  general  loss  criteria  as  in  [6],  but 
for  the  example  considered  in  this  paper  we  use  the  criterion  above. 

Definition  1  (Learning  From  Generalized  Samples)  Let  V  be  a  fixed  and  known  collection 
of  probability  measures.  Let  F  be  a  collection  of  functions  from  the  instance  space  X  into  R,  and 
let  X  be  the  instance  space  of  generalized  samples  for  F.  F  is  said  to  be  learnable  with  respect  to 
V  from  the  generalized  samples  X  if  there  is  a  mapping  A:  Sp  F  for  producing  a  hypothesis  h 
from  a  set  of  labeled  samples  such  that  for  every  e,  ^  >  0  there  is  a  Q  <  m  =  rn(e,  S)  <  oo  such  that 
for  every  probability  measure  P  £V  and  every  f  £  F,  if  h  is  the  hypothesis  produced  from  a  labeled 
m-sample  drawn  according  to  P"*  then  the  probability  that  dp{f,  h)  <  e  is  greater  than  1  —  <5. 

If  V  is  the  set  of  all  distributions  over  some  cr-algebra  of  X  then  this  corresponds  to  distribution- 
free  learning  from  generalized  samples.  If  V  consists  of  a  single  distribution  P  then  this  corresponds 
to  fixed  distribution  learning  from  generalized  samples.  This  is  a  direct  extension  of  the  usual 
definition  PAC  learnability  (see  for  example  [4])  to  learning  functions  from  generalized  samples 
over  a  class  of  distributions.  In  the  definition  we  have  assmned  that  there  is  an  underlying  target 
concept  /.  As  with  the  restrictions  mentioned  earlier,  this  could  be  removed  following  the  framework 
of  [6]. 

Learning  with  generalized  samples  can  be  easily  transformed  into  an  equivalent  problem  of  PAC 
learning  from  standard  samples.  The  concept  class  F  on  X  corresponds  naturally  to  a  concept  class 
P  on  A  as  follows.  For  a  fixed  /  G  P,  each  functional  x  e  X  produces  a  real  number  when  applied 
to  /.  Therefore,  /  induces  a  real  valued  fimction  on  A  in  a  natural  way.  The  real  valued  function 
on  A  induced  by  /  will  be  denoted  by  /,  and  is  defined  by 

f{x)  =  x{f) 

The  concept  class  P  is  the  collection  of  all  functions  on  A  obtained  in  this  way  as  /  ranges  through 

P. 

We  are  now  in  the  standard  PAC  framework  with  instance  space  A,  concept  class  P,  and 
distribution  P  on  A.  Hence,  as  usual,  P  induces  a  learning  criterion  or  metric  (actually  only  a 
pseudo-metric  in  general)  on  P,  and  as  a  result  of  the  correspondence  between  P  and  P,  this  metric 
is  the  equivalent  to  the  (pseudo- )metric  dp  induced  by  P  on  P  defined  above.  This  metric  will  be 
denoted  by  dp  over  both  P  and  P,  and  is  given  by 

rfp(/i,  h)  =  E\f,  -  h\  =  E\x{h)  -  x{h)\  =  dp(/i,  h) 

Distribution-free  and  fixed  distribution  learnability  are  defined  in  the  usual  way  for  A  and  P. 
Thus,  a  generalized  notion  of  VC  dimension  for  fimctions  (called  pseudo  dimension  in  [6])  and 
metric  entropy  of  P  characterize  the  learnability  of  P  in  the  distribution-free  and  fixed  distribution 
cases  respectively.  These  same  quantities  for  P  then  also  characterize  the  learnability  of  P  with 
respect  to  dp. 
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Definition  2  (Metric  Entropy)  Let  {Y,p)  be  a  metric  space.  A  set  is  said  to  be  an  e-cover 
(or  e-approximation)  for  Y  if  for  every  y  £Y  there  exists  y'  G  Y(^^  such  that  p{y,y')  <  e.  Define 
N[e)  =  N{e,  Y,  p)  to  be  the  smallest  integer  n  such  that  there  exists  an  e-cover  for  Y  with  n  elements. 
If  no  such  n  exists,  then  N{€,Y,p)  =  oo.  The  metric  entropy  ofY  (often  called  the  e-entropy^  is 
defined  to  be  logj  N{e). 


N(e)  represents  the  smallest  number  of  balls  of  radius  €  which  are  required  to  cover  Y.  For 
convenience,  if  P  is  a  distribution  we  will  use  the  notation  N{e,  C,  P)  (instead  of  N (e,  C,  dp)),  and 
we  will  speak  of  the  metric  entropy  of  C  with  respect  to  P,  with  the  imderstanding  that  the  metric 
being  used  is  dp{-,  •). 

Using  restdts  from  [6]  (based  on  resrdts  from  [12]),  we  have  the  following  result  for  learning  from 
generalized  samples  with  respect  to  a  fixed  distribution. 


Theorem  1  P  is  learnable  from  generalized  samples  (or  equivalently,  F  is  leamable)  with  respect 
to  a  distribution  P  if  for  each  e  >  0  there  is  a  finite  e-cover  for  F  (with  respect  to  dp)  such 
that  0  <  /i  <  M(e)  for  each  G  Furthermore,  a  sample  size 


m(e,  S)  > 


2M^(e/2)^2|p("/^)| 


is  sufficient  for  e,  6  leamability. 


Proof:  Let  be  an  |-cover  with  0  <  fi  <  M(€/2)  for  each  fi  G  Let  pl*/^)  be  obtained 

from  p(*/2)  using  the  correspondence  between  P  and  P.  After  seeing  seeing  m{e,S)  samples,  let 
the  learning  algorithm  output  a  hypothesis  h  G  which  is  most  consistent  with  the  data,  i.e., 

which  minimizes 

m(e,S) 

where  {xi,yi)  are  the  observed  generalized  samples.  Then  using  Theorem  1  of  [6],  it  follows  that 
with  probability  greater  than  1  —  ^  we  have  dp{f,  h)  <  e. 


□ 

Although  we  will  not  use  distribution-free  learning  in  the  example  of  learning  a  curve,  for 
completeness  we  give  a  result  for  this  case. 

Definition  3  (Pseudo  Dimension)  Let  F  be  a  collection  of  functions  from  a  set  Y  to  R.  For 
any  set  of  points  y  =  {yi,...,yd)  from  Y,  let  P|^  =  {{f{yi),...,f{yd))  :  /  G  P}.  P|y  is  a  set  of 
points  in  R*^.  If  there  is  some  translation  of  F^y  which  intersects  all  of  the  2^  orthants  ofSU^  then  y 
is  said  to  be  shattered  by  F.  Following  terminology  from  [6],  the  pseudo  dimension  of  F,  which  we 
denote  dim(P),  is  the  largest  integer  d  such  that  there  exists  a  set  of  d  points  in  Y  that  is  shattered 
by  F.  If  no  such  largest  integer  exists  then  dim(P)  is  infinite. 

We  have  the  following  result  for  distribution-free  learning  from  generalized  samples,  again  using 
results  from  [6]. 
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Theorem  2  F  is  distribution-free  leamable  from  generalized  samples  (or  equivalently,  F  is 
distribution-free  leamable)  if  for  some  M  <  oo  we  have  0  <  f  <  M  for  every  f  £  F  and  if 
clim(^)  =  d  for  some  1  <  d  <  oo.  Furthermore,  a  sample  size 


m{e,S)  > 


64M2 

e2 


2rfhi 


16eM 


is  sufficient  for  e,  6  distribution-free  learnability. 


Proof:  The  restxlt  follows  from  a  direct  application  of  Corollary  2  from  [6],  together  with  the 
correspondence  between  F  and  F  and  the  fact  that  dp{fi,  f2)  =  dp(/i,  f2). 


a 

Note  that  the  metric  entropy  of  F  is  identical  to  the  metric  entropy  of  F  (since  both  are  with 
respect  to  dp),  so  that  the  metric  entropy  of  F  characterizes  learnability  for  a  fixed  distribution  as 
well.  However,  the  pseudo  dimension  of  F  with  respect  to  X  does  not  characterize  distribution-free 
learnability.  This  quantity  can  be  very  different  from  the  pseudo  dimension  of  F  with  respect  to 

X. 

As  mentioned  above,  for  simplicity  we  have  defined  the  concepts  to  be  real  valued  functions, 
have  chosen  the  generalized  samples  to  return  real  values,  and  have  selected  a  particular  form  for 
the  learning  criterion  or  metric  dp.  Our  ideas  can  easily  be  formulated  in  the  much  more  general 
framework  considered  by  Haussler  [6].  Specifically,  one  could  take  F  to  be  a  family  of  functions 
with  domain  X  and  range  Y.  The  generalized  samples  X  wordd  be  a  collection  of  mappings  from 
F  to  F.  A  family  of  functions  F  mapping  X  to  F  would  be  obtained  from  F  by  assigning  to  each 
f  £  F  an  f  £  F  defined  by  f{x)  =  x{f).  As  in  [6],  the  distributions  would  be  defined  on  X  x  F,  a 
loss  function  L  would  be  defined  on  F  X  F,  and  for  each  f  £  F  the  error  of  the  hypothesis  /  with 
respect  to  a  distribution  would  be  EL{f{x),y)  where  the  expectation  is  over  the  distribution  on 

j/)- 

Although  learning  with  generalized  samples  is  in  essence  simply  a  transformation  to  a  different 
standard  learning  problem,  it  allows  the  learning  framework  and  results  to  be  applied  a  broad  range 
of  problems.  To  show  the  variety  in  the  type  of  observations  that  are  available,  we  briefly  mention 
some  types  of  generalized  samples  that  may  be  of  interest  in  certain  applications.  In  the  case  where 
the  concepts  are  subsets  of  X  (i.e.,  binary  valued  functions),  some  interesting  generalized  samples 
might  be  to  draw  random  (parameterized)  subsets  (e.g.,  disks,  lines,  or  other  parameterized  curves) 
of  X  labeled  as  to  whether  or  not  the  random  set  intersects  or  is  contedned  in  the  target  concept. 
Alternatively,  the  random  set  could  be  labeled  as  to  the  number  of  intersections  (or  length,  area, 
or  volume  of  the  intersection,  as  appropriate).  In  the  case  where  the  concepts  are  real  valued 
functions,  one  might  consider  generalized  samples  consisting  of  certain  random  sets  and  returning 
the  integral  of  the  concept  over  these  sets.  For  example,  drawing  random  lines  would  correspond 
to  tomographic  type  problems  with  random  ray  sampling.  Other  possibilities  might  be  to  return 
weighted  integrals  of  the  concept  where  the  weighting  function  is  selected  randomly  from  a  suitable 
set  (e.g.,  an  orthonormal  basis),  or  to  sample  derivatives  of  the  concept  at  random  points. 
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3  A  Result  From  Stochastic  Geometry 


III  this  section  we  state  an  interesting  and  well  known  result  from  stochastic  geometry.  This  result 
will  be  used  in  the  next  section  in  cormection  with  a  specific  example  of  learning  from  generalized 
samples. 

To  state  the  result,  we  first  need  to  describe  the  notion  of  drawing  a  “random”  straight  line, 
i.e.,  a  uniform  distribution  for  the  set  of  straight  lines  intersecting  a  bounded  domain.  A  line  in  the 
plane  will  be  parameterized  by  the  polar  coordinates  r,  6  of  the  point  on  the  line  which  is  closest  to 
the  origin,  where  r  >  0  and  0  <  0  <  27r.  The  set  (manifold)  of  aU  lines  in  the  plane  parameterized 
in  this  way  corresponds  to  a  semi-infinite  cylinder. 

A  well  known  result  from  stochastic  geometry  states  that  the  unique  measure  (up  to  a  scale 
factor)  on  the  set  of  lines  which  is  invariant  to  rigid  transformations  of  the  plane  (translation, 
rotation)  is  drdO,  i.e.,  uniform  density  in  r  and  9.  This  measure  is  thus  independent  of  the  choice 
of  coordinate  system,  and  is  referred  to  as  the  uniform  measure  (or  density)  for  the  set  of  straight 
lines  in  the  plane.  This  measure  corresponds  precisely  to  the  surface  area  measure  on  the  cylinder. 

From  this  measure,  a  uniform  probability  measure  can  be  obtained  for  the  set  of  all  straight 
lines  intersecting  a  bounded  domain.  Specifically,  the  set  of  straight  lines  intersecting  a  bounded 
domain  X,  which  we  will  denote  by  X,  is  a  bounded  subset  of  the  cylinder.  The  uniform  probability 
measure  on  X  is  then  just  the  surface  area  measure  of  the  cylinder  suitably  normalized  (i.e.,  by  the 
area  of  X). 

We  can  now  state  the  following  classic  result  from  stochastic  geometry  (see  e.g.  [14,  2]). 

Theorem  3  Let  X  be  a  bounded  convex  subset  o/R2,  and  letccX  be  a  rectifiable  curve.  Suppose 
lines  intersecting  X  are  drawn  uniformly,  and  let  n(»,c)  denote  the  number  of  intersections  of  the 
random  line  x  with  the  curve  c.  Then 


En{x,c)  = 

where  £(c)  denotes  the  length  of  the  curve  c  and  A  is  the  perimeter  of  X. 

In  the  next  section,  for  simplicity  we  wiU  take  X  to  be  the  unit  square.  In  this  case,  the  theorem 
reduces  simply  to  En{x,c)  =  \C{c). 

A  surprising  (and  powerful)  aspect  of  this  theorem  is  that  the  expected  number  of  intersections 
a  random  line  makes  with  the  curve  c  depends  only  on  the  length  of  c  but  is  independent  of  any 
other  geometric  properties  of  c.  In  fact,  the  expression  on  the  left  hand  side  (suitably  normalized) 
can  be  used  as  a  definition  for  the  length  (or  one- dimensional  measure)  of  general  sets  in  the  plane 
[15]. 

An  interesting  implication  of  Theorem  3  is  that  the  length  of  an  unknown  curve  can  be  estimated 
or  “learned”  if  one  is  told  the  nmnber  of  intersections  between  the  unknown  curve  and  a  collection 
lines  drawn  randomly  (from  the  uniform  distribution).  In  fact,  deterministic  versions  of  this  idea 
have  been  studied  [17,  11). 


7 


4  Learning  a  Curve  by  Counting  Intersections  with  Lines 


In  this  section,  we  consider  a  particular  example  of  learning  from  generalized  samples.  For  con¬ 
creteness  we  take  X  to  be  the  unit  square  in  R^,  although  our  results  easily  extend  to  the  case 
where  X  is  any  bounded  convex  domain  in  R^.  We  will  consider  concept  classes  C  which  are 
collections  of  curves  contained  in  X.  For  example,  one  particular  concept  class  of  interest  will  be 
the  set  of  straight  line  segments  contained  in  X.  Other  concept  classes  will  consist  of  more  general 
curves  in  X  satisfying  certain  regularity  constraints.  The  samples  observed  by  the  learner  consist 
of  randomly  drawn  straight  lines  labeled  as  to  the  number  of  intersections  the  random  line  makes 
with  the  target  concept  (i.e.,  the  unknown  curve).  Recall,  that  with  the  r,d  parameterization,  the 
set  of  lines  intersecting  X,  which  is  the  instance  space  X,  is  a  bounded  subset  of  the  semi-infinite 
cylinder.  We  consider  learnability  with  respect  to  a  fixed  distribution,  where  the  distribution  P  is 
the  uniform  distribution  on  X. 

4.1  Learning  a  Line  Segment 

Consider  the  case  where  C  is  the  set  of  straight  line  segments  in  X.  In  this  case,  given  a  concept 
c  G  C,  every  straight  line  (except  for  a  set  of  measure  zero)  intersects  c  either  exactly  once  or  not 
at  all.  Thus,  C  consists  of  subsets  (i.e.,  binary  valued  functions)  of  X,  where  each  c  E  C  contains 
exactly  those  straight  lines  x  E  X  which  intersect  the  corresponding  c  E  C. 

The  metric  dp  on  C  and  C  induced  by  P  is  given  by 

dp{ci,C2)  =  dp(ci,C2)  =  E\n{x,ci)  -  n{x,C2)\  =  P(ciAc2) 

where,  as  in  the  previous  section,  n(«,c)  is  the  number  of  intersections  the  line  x  makes  with  c.  In 
the  case  of  line  segments  n{x,c)  is  either  one  or  zero,  i.e.  c  is  binary  valued,  so  that 

dp(ci,C2)  =  dp(ci,C2)  =  P(c2Ac2) 

where  CiAc2  is  the  usual  symmetric  difference  of  ci  and  C2. 

In  the  case  of  line  segments,  a  simple  bound  on  the  dp  distance  between  two  segments  can  be 
obtained  in  terms  of  the  distances  between  the  endpoints  of  the  segments. 

Lemma  1  Let  C\,C2  be  two  line  segments,  and  let  ai,bi  and  02,62  be  the  endpoints  of  C\  and  C2 
respectively.  Then 

dp{ci,C2)  <  ^  (||ai  -  0211  +  ||6i  -  62II) 

Proof;  Since  ci,C2  are  line  segments,  the  distance  dp(ci,C2)  between  ci  and  C2  is  the  probability 
that  a  random  line  intersects  exactly  one  of  ci  and  C2.  Any  line  that  intersects  exactly  one  of  ci,  C2 
must  intersect  one  of  the  segments  0^02  or  6162  joining  the  endpoints  of  Cj  and  C2.  Therefore, 

^p(ci,  C2)  <  P{x  n  0102  0  or  ®  n  6162  7^  0)  <  P{x  D  01O2  7^  0)  +  P{x  PI  6162  7^  0) 

Using  Theorem  3,  the  probability  that  a  random  line  intersects  a  line  segment  in  the  rniit  square 
is  simply  half  the  length  of  the  line  segment,  from  which  the  resrdt  follows. 


□ 
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Using  the  Lenuna  1,  we  can  bound  the  metric  entropy  of  C  (and  hence  C)  with  respect  to  the 
metric  induced  by  P. 

Lemma  2  Let  C  be  the  set  of  line  segments  contained  in  the  unit  square  X,  and  let  P  be  the 
uniform  distribution  on  the  set  of  lines  intersecting  X .  Then 

N{e,C,P)  =  N{e,C,P)<^ 

Proof:  We  construct  an  e-cover  for  C  as  follows.  Consider  a  rectangular  grid  of  points  in  X  with 
spacing  \/2e.  Let  be  the  set  of  all  line  segments  with  endpoints  on  this  grid.  There  are  ^ 
points  in  the  grid,  so  that  there  are  ^  line  segments  in  (We  ignore  the  fact  that  some  of 

these  segments  are  actually  just  points,  since  there  are  just  ^  of  these.)  For  any  c  G  C,  there  is 
a  c'  G  C'(')  such  that  each  endpoint  of  c'  is  within  e  of  an  endpoint  of  c.  Hence,  from  Lemma  1 
dp{c,  c')  <  |(e  -I-  e)  =  €  so  that  is  an  e-cover  for  C  with  elements. 


□ 

The  construction  of  the  previous  lemma  allows  us  to  obtain  the  following  learning  result  for 
straight  line  segments. 

Theorem  4  Let  C  be  the  set  of  line  segments  in  the  unit  square  X .  Then  C  is  learnable  by  counting 
intersections  with  straight  lines  drawn  uniformly  using 

m{e,S)=^]n^ 

samples. 

Proof:  Let  C  be  the  concept  class  over  X  corresponding  to  C.  Then  c  G  C  is  defined  by  c{x)  = 
n{x,c),  i.e.,  c{x)  is  the  number  of  intersections  of  the  line  x  with  c.  Clearly,  0  <  c  <  1  (except  for 
a  set  of  measure  zero)  for  every  c  £  C  Using  the  construction  of  Lemma  2,  we  have  an  |-cover  of 
C  with  4/e^  elements.  Hence,  the  result  follows  from  Theorem  1. 


□ 


4.2  Learning  Curves  of  Bounded  Turn  and  Length 

Now  we  consider  the  learnability  of  a  much  more  general  class  of  curves.  First  we  need  some 
preliminary  definitions.  We  will  consider  rectifiable  curves  parameterized  by  arc  length  s,  so  that 
a  curve  c  of  length  L  is  given  by 


c  =  {(ri(s),  0:2(5))  I  0  <  5  <  L} 

where  a:i(-)  and  a:2(-)  are  absolutely  continuous  functions  from  [0,1-]  to  R  such  that  \Jx.\-\-  ^2 
defined  and  equal  to  unity  almost  everywhere.  If  x\  and  0:2  are  twice-differentiable  at  5,  then  the 
curvature  of  c  at  5,  k{s),  is  defined  as  the  rate  of  change  of  the  direction  of  the  tangent  to  the  curve 
at  5,  and  is  given  by  «,(5)  =  ®2®i  -  The  total  absolute  curvature  of  c  is  given  by  fg'  |k(5)|  ds. 
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Alexandrov  and  Reshetnyak  [1]  have  developed  an  interesting  theory  for  irregular  curves.  A- 
mong  other  things,  they  study  the  notion  of  the  “turn”  of  a  curve,  which  is  a  generalization  of 
total  absolute  curvature  to  curves  which  are  not  necessarily  twice- differentiable.  For  example,  for 
a  piecewise  linear  curve  the  turn  is  simply  the  sum  of  the  absolute  angles  that  the  tangent  turns 
between  adjacent  segments.  The  turn  for  more  general  curves  can  be  obtained  by  piecewise  linear 
approximations.  In  fact,  this  is  precisely  the  manner  in  which  turn  is  defined  [1]. 

Definition  4  (Turn)  Let  uq  •  •  •  denote  a  piecewise  linear  curve  with  vertices  uq,  . . . ,  Vn-  Let  ai 
be  the  vector  UiHiW,  and  let  <f)i  be  the  angle  between  the  vector  and  a^+i .  That  is,  <^i  represents 
the  total  angle  through  which  the  tangent  to  the  curve  turns  at  vertex  i  (w  minus  the  interior  angle 
at  vertex  i).  The  turn  of  vq  -  •  -  denoted  K(t>o  •  •  •'^n)i  defined  by 

n—1 

«(V0---Vn) 

i=l 

The  turn  of  a  general  parameterized  curve  c,  denoted  k{c),  is  defined  as  the  supremum  of  the  turn 
over  all  piecewise  linear  curves  inscribed  in  c.  L.e., 

/c{c)  =  SUp{/t(c(so)  •  •  •c(s„))  I  0  <  So  <  Si  <  •  •  •  <  5n  <  .t} 
where  L  is  the  length  of  c. 

As  expected,  the  notion  of  turn  reduces  to  the  total  absolute  curvature  of  a  curve  when  the  latter 
quantity  is  defined  [1].  We  will  use  the  generalized  notion  of  turn  throughout,  so  that  our  results 
will  apply  to  curves  which  are  not  necessarily  twice  differentiable  (e.g.,  piecewise  linear  curves). 

We  wiU  consider  classes  of  curves  of  bounded  length  and  bounded  turn.  Specifically,  let  Ck,l 
be  the  set  of  all  curves  contained  in  the  unit  square  whose  length  is  less  than  or  equal  to  L  and 
whose  turn  is  less  than  or  equal  to  K.  Note  that  for  curves  contained  in  a  bounded  domain,  the 
length  of  a  curve  can  be  bounded  in  terms  of  the  turn  of  the  curve  and  the  diameter  of  the  domain 
(Theorem  5.6.1  from  [1],  for  differentiable  curves  see  for  example  [14]  p.  35).  Hence,  we  reaUy  need 
only  consider  classes  of  curves  with  a  botmd  on  the  turn.  However,  for  convenience  we  will  carry 
both  parameters  K  and  L  explicitly. 

As  before,  the  samples  will  be  random  lines  drawn  according  to  the  uniform  distribution  P  on 
X,  labeled  as  to  the  mmiber  of  intersections  the  line  makes  with  the  unknown  curve  c.  However, 
with  curves  in  Ck,l  the  number  of  intersections  with  a  given  line  can  be  any  positive  integer  as 
opposed  to  just  zero  or  one  for  straight  line  segments.  (Note  that  by  Theorem  3,  the  probability 
that  a  random  line  has  an  infinite  number  of  intersections  with  a  given  curve  is  zero,  so  that  the 
number  of  intersections  is  a  weU  defined  integer  valued  function.)  Thus,  the  class  Ck,l  consists  of 
a  collection  of  integer  valued  fimctions  on  X  as  opposed  to  just  subsets  of  X  as  in  the  previous 
section. 

Also,  as  before,  the  results  on  learning  for  the  set  of  curves  wiU  be  with  respect  to  the  metric 
dp  induced  by  the  measure  P.  That  is  the  dp  distance  between  two  curves  ci  and  C2  or  their 
corresponding  functions  ci,  C2  is  given  by 

dp{ci,C2)  =  dp{ci,C2)  -  E\n{x,ci)  -  n{x,C2)\ 

where  the  expectation  is  taken  over  the  random  line  x  with  respect  to  the  uniform  measure  P.  This 
notion  of  distance  between  curves  has  been  studied  previously  (e.g.,  see  [17]  and  [14]  p.38).  For 
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example,  it  is  known  that  dp  is  in  fact  a  metric  on  the  set  of  rectifiable  curves,  so  that  dp  satisfies 
the  triangle  inequality  and  dp{ci,C2)  =  0  implies  cj  -  C2.  (Note  that  in  the  references  [14,  17]  the 
notion  of  distance  used  is  actually  ^dp,  but  this  makes  no  difference  in  the  metric  properties.) 

To  obtain  a  learning  result  for  Ck,l  we  will  show  that  each  curve  in  Ck,l  can  be  approximated 
(with  respect  to  dp)  by  a  botmded  number  of  straight  line  segments.  The  metric  entropy  compu¬ 
tation  for  a  single  straight  line  segment  can  be  extended  to  provide  a  metric  entropy  bound  for 
curves  consisting  of  a  bounded  number  of  straight  line  segments.  Thus,  by  combining  these  two 
ideas  we  can  obtain  a  metric  entropy  bound  for  Ck,l  which  yields  the  desired  learning  result. 

First,  we  need  several  properties  of  the  dp  metric  for  curves  of  bounded  turn. 

Lemma  3  If  ci,C2  are  curves  with  a  common  endpoint  (so  that  Ci  U  C2  is  a  curve)  and  similarly 
for  c\ ,  C2  then 

dp{ci  U  C2,  c[  U  C2)  <  dp{ci,  c[)  +  dp{c2, 4) 

Proof:  For  any  line  x  (except  for  a  set  of  measure  zero),  n{x,Ci  U  C2)  =  n(x,ci)  +  n{x,C2)  and 
similetrly  for  c\,C2.  Therefore, 

dp(ci  Uc2,ci  Ucj)  =  ^jF|n(®,Ci  UC2)  -  n(®,ci  Ucj)! 

=  ^E\n{x,ci)  -  n{x,c\)  -f  n{x,C2)  -  n{x,c'2)\ 

<  ^E\n{x,  Cl)  -  n{x,  c^)]  +  ^E\n(x,  C2)  -  n(x,  c^)! 

=  dp{ci,c[)  + dp{c2,c'2) 

□ 

By  induction,  this  result  can  clearly  be  extended  to  unions  of  any  finite  nmnber  curves.  The 
case  of  a  finite  number  of  cmrves  will  be  used  in  Lemma  6  below. 

Lemma  4  If  c  is  a  curve  and  c  is  the  line  segment  joining  the  endpoints  of  c,  then 

<ip(c,c)=l(£(c) -£(«)) 

Proof:  Each  line  can  intersect  c  at  most  once,  and  every  line  intersecting  c  also  intersects  c. 
Therefore,  n(£,c)  >  n{x,c)  so  that  \n{x,c)  -  n{x,c)\  =  n{x,c)  -  n{x,c)  for  all  lines  x  (except  a  set 
of  measure  zero).  Hence, 

dp{c,  c)  =  E\n{x,  c)  -  n(®,  c)|  =  E  {n{x,  c)  -  n(®,  c))  =  r>C(c)  -  |£(c) 
where  the  last  equality  follows  from  the  stochastic  geometry  result  (Theorem  3). 

□ 

We  will  make  use  of  the  following  result  from  [1]. 
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Theorem  6  (Alexandrov  and  Reshetnyak)  Let  c  be  a  curve  in  R”  with  k(c)  <  tt,  and  let  a 
be  the  distance  between  its  endpoints.  Then 


C{c)< 


a 


cos 


k(c) 


Equality  is  obtained  iff  c  consists  of  two  line  segments  of  equal  length. 


Lemma  5  For  0  <  a  <  7r/6,  1/cosa  —  1  <  so  that  if  c  is  a  curve  with  turn  k(c)  <  7r/6  and  c 
is  the  line  connecting  the  endpoints  of  c  then 

dp{c,c)  < 

Proof:  Let  gr(o;)  1/cosa  and  h{a)  =  1  +  a^.  For  0  <  a  <  7r/6,  sina  <  1/2  and  cosa  >  'v/3/2 
so  that  g{a)  =  2sin^a/  cos^a  +  1/cosa  <  +  ^  =  <2  =  h{a).  Combining  g{a)  <  h{a) 

with  the  fact  that  g'(O)  =  h(0)  and  ^(0)  =  h(0)  gives  g{a)  <  h{a)  and  so  1/cosa  —  1  <  a^  for 
0  <  a  <  7r/6. 

Now,  using  the  above  resrdt,  Lenuna  4,  and  Theorem  5  we  have 

Mcc)  =  i  (Ac)  -  £(c))  <  Iac)  -  l)  < 


□ 

Lemma  6  If  c  £  Ck,l  then  for  each  e  >  0  the  curve  c  can  be  approximated  to  within  e  by  an 
inscribed  piecewise  linear  curve  with  at  most  segments. 

Proof;  As  usual,  let  s  denote  arc  length  along  c.  Since  /c(c)  <  K,  for  any  a  >  0,  we  can  find  a 
decomposition  of  c  into  at  most  \Kja\  pieces  •  •,I\K/a\  such  that  <  a  for  each  i.  For 
example,  let  so  =  0  and  let 


Si  =  sup{si_i  <  s  <  L|/c(c(sj_i,s))  <  a} 


where  c(si_i,5)  is  the  part  of  the  curve  c  between  arc  length  Si_i  and  s  inclusive.  Then,  let  = 
c(si_i.  Si).  By  definition,  K(£i)  <  a.  The  turn  of  a  curve  satisfies  k(c(s,  s'))  >  k(c(s,  t))  +  K(c(t,  s')) 
for  any  s  <  t  <  s'  and  re(c(s,  s'))  — >  0  as  s'  ^  s  from  the  right  ([1],  Corollaries  2  and  3,  p.  121). 
From  these  properties  it  follows  that  if  s^  <  L  then  for  any  q  >  0,  k(c(0,  Sj  +  77))  >  ia.  Since 
k(c)  <  at  we  must  have  Si  =  L  for  some  i  <  f/f/a]. 

Now,  let  li  be  the  line  segment  joining  the  ends  of  ii.  Clearly,  the  union  of  the  f  j  form  a  piecewise 
linear  curve  inscribed  in  c  (i.e.,  with  endpoints  of  the  segments  lying  on  c).  From  Lemmas  3  and 
5,  and  the  fact  that  C{ii)  <  C{c)  <  L,  we  have 


K/a 


dp{c,\jf2f:ii)<Y^dp{ii,ii)< 


i=l 


K 


a 


L  2  KL 
—or  =  - 


a 


Thus,  for  a  <  dp(c,  <  e  so  that  ^  segments  suffice  for  an  e-approximation  to 

c  by  an  inscribed  piecewise  linear  curve. 
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□ 


Theorem  6  Let  Ck,l  ^he  set  of  all  curves  in  the  unit  square  with  turn  bounded  by  K  and  length 
bounded  by  L.  Let  P  be  the  uniform  distribution  on  the  set  of  lines  intersecting  the  unit  square, 
and  let  dp  be  the  metric  on  Ck,l  defined  by  dp{ci,C2)  =  E\n{z,cx)  —  n{x,C2)\.  Then  the  metric 
entropy  of  Ck,l  with  respect  to  dp  satisfies 


N{e,CK,L,P)< 


(K^L^\ 
\  8e4  ) 


i+i 


Proof:  We  construct  an  e-cover  for  C  as  follows.  Consider  a  rectangular  grid  of  points  in  the  unit 
square  with  spacing  Let  of  piecewise  linear  curves  with  at  most 

segments  the  endpoints  of  which  all  lie  on  this  grid.  There  are  K^L^jBe'^  points  in  the  grid,  so  that 
there  are  at  most  distinct  curves  in 

To  show  that  is  an  e-cover  for  Ck,l^  let  c  £  Ck,l-  By  Lemma  6  there  is  a  piecewise  linear 
curve  c  with  at  most  segments  such  that  dp{c,c)  <  e/2.  We  can  find  a  curve  c'  G  close 

to  c  by  finding  a  point  on  the  grid  within  of  each  endpoint  of  a  segment  in  c.  By  Lemma  1 
each  line  segment  of  c'  is  a  distance  at  most  (with  respect  to  dp)  from  the  corresponding  line 
segment  of  c.  Since  c,  c'  consist  of  at  most  segments,  applying  Lemma  3  we  get  dp{c,  c')  <  e/2. 
Hence,  by  the  triangle  inequality  dp{c,c')  <  e. 


□ 


We  can  now  prove  a  learning  result  for  curves  of  bounded  tturn  and  length; 


Theorem  7  Let  Ck,l  be  the  set  of  all  curves  in  the  unit  square  with  turn  bounded  by  K  and  length 
bounded  by  L .  Then  Ck,l  leamable  by  counting  intersections  with  straight  lines  drawn  uniformly 
using 


m(e,  = 


K^L^ 

2e^ 


In 


2 

S 


2K^L^\ 

) 


Proof:  Let  Ck,l  Le  the  concept  class  over  X  corresponding  to  Ck,l-  Then  c  G  is  defined  by 
c(x)  —  n(x,c),  i.e.,  c(x)  is  the  number  of  intersections  of  the  line  *  with  c. 

Using  the  construction  of  Theorem  6,  we  have  an  |-cover,  of  Ck,l  with 

elements.  Furthermore,  each  element  of  the  |-cover  consists  of  at  most  line  segments.  Since 

a  line  x  can  intersect  each  segment  at  most  once,  we  have  0  <  Ci(*)  <  for  every  Cj  G 

Hence,  the  result  follows  from  Theorem  1. 


□ 

It  is  interesting  to  note  that  Ck,l  Las  infinite  pseudo  dimension  (generalized  VC  dimension), 
so  that  one  would  not  expect  Ck,l  to  he  distribution-free  learnable.  That  the  pseudo  dimension 
is  infinite  can  be  seen  as  follows.  First,  assume  that  A”,  L  >  2w.  For  each  k,  let  ®i, . . . ,  Xk  be  the 
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set  of  lines  corresponding  to  the  sides  of  a  fc-gon  inscribed  in  the  unit  circle.  For  any  subset  G  of 
these  k  lines,  we  can  find  a  curve  cq  G  Ck,l  so  that  n(®i,  cg)  =  2  for  Xi  £  G  and  n{xi,  cq)  =  0  for 
Xi  ^  G.  Such  a  curve  can  be  obtained  by  taking  a  point  on  the  unit  circle  in  each  arc  corresponding 
to  Xi  e  G,  and  taking  cg  to  be  the  boundary  of  the  convex  hull  of  these  points.  Then,  k{cg)  =  27r 
and  C{cg)  <  2t  so  that  cg  G  Ck,l-  Thus,  the  set  ®i,...,®fc  is  shattered  by  Ck,l^  ^nd  since  k 
is  arbitrary  the  pseudo  dimension  of  Ck,l  is  infinite.  For  K,L  <  2k  we  can  apply  essentially  the 
same  construction  over  an  arc  of  the  unit  circle  and  without  taking  cq  to  be  a  closed  curve. 

4.3  Connections  With  the  Stochastic  Geometry  Result 

For  the  class  of  curves  whose  length  and  curvature  are  bounded  by  constants,  the  learnability  result 
of  Theorem  7  can  be  thought  of  as  a  refinement  of  the  stochastic  geometry  result.  First,  using  the 
expression  for  the  expected  number  of  intersections,  one  can  estimate  or  “leairn”  the  length  of  c 
from  a  set  of  generalized  samples.  The  learnability  result  makes  the  much  stronger  statement  that 
the  curve  c  itself  can  be  learned  (from  which  the  length  can  then  be  estimated).  To  show  that  the 
length  can  be  estimated,  we  need  only  note  that 

111 
|£(ci)  -  £(c2)|  =  \-E{n{y,ci)  -  n{y,C2))\  <  -E\n{y,ct)  -  n(^,C2)l  =  ■^dp{ci,C2) 

so  that  if  we  learn  c  to  within  e  then  the  length  of  c  cem  be  obtained  to  within  e/2. 

Second,  for  the  class  of  curves  considered,  we  have  a  uniform  learning  result.  Hence,  this  refines 
the  stochastic  geometry  result  by  guaranteeing  imiform  convergence  of  empirical  estimates  of  length 
to  true  length  for  the  class  of  curves  considered. 


5  Discussion 

We  introduced  a  model  of  learning  from  generalized  samples,  cind  considered  an  application  of  this 
model  to  a  problem  of  reconstructing  a  curve  by  cotmting  intersections  with  random  lines.  The 
curve  reconstruction  problem  is  closely  related  to  a  well  known  result  from  stochastic  geometry. 
The  stochastic  geometry  result  (Theorem  3)  suggests  that  the  length  of  a  curve  can  be  estimated 
by  counting  the  nTimber  of  intersections  with  an  appropriate  set  of  lines,  and  this  has  been  studied 
by  others.  Our  results  show  that  for  certain  classes  of  curves  the  curve  itself  can  be  learned  from 
such  information.  Furthermore,  over  these  classes  of  ctuves  the  estimates  of  length  from  a  random 
sample  converge  uniformly  to  the  true  length  of  a  curve. 

The  learning  result  for  curves  is  in  terms  of  a  metric  induced  by  the  uniform  measure  on  the  set 
of  lines.  Although  some  properties  of  this  metric  are  known,  to  better  understand  the  implications 
of  the  learning  result,  it  would  be  useful  to  obtain  further  properties  of  this  metric.  One  approach 
might  be  to  obtain  relationsliips  between  this  metric  and  other  metrics  on  sets  of  curves  (e.g., 
HausdorfF  metric,  dn)-  For  example,  we  conjecture  that  over  the  class  Ck,l 

inf  dp(ci,C2)>0 

{ci,C2  I  ciif(ci,C2)>e} 

This  result  combined  with  the  learning  result  with  respect  to  dp  would  immediately  imply  a  learning 
result  with  respect  to  dp. 
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The  stochastic  geometry  result  holds  for  any  bounded  convex  subset  of  the  plane,  and  as  we 
mentioned  before,  our  results  can  be  extended  to  tliis  case  as  well.  Furthermore,  results  analogous 
to  Theorem  3  can  be  shown  in  higher  dimensions  and  in  some  non- Euclidean  spaces  [14].  Some 
results  on  curves  of  bounded  turn  analogous  to  those  we  needed  also  can  be  obtained  more  generally 
[Ij.  Hence,  learning  results  should  be  obtainable  for  these  cases. 

Regarding  other  possible  extensions  of  the  problem  of  learning  a  curve,  note  that  the  stochastic 
geometry  result  is  not  true  for  distributions  other  than  the  uniform  distribution.  Also,  we  are 
not  aware  of  any  generalizations  to  cases  where  parameterized  curves  other  than  lines  are  drawn 
randomly.  However,  learnability  results  likely  hold  true  for  some  other  distributions  and  perhaps 
for  other  randomly  drawn  parameterized  curves,  although  the  metric  entropy  computations  may 
be  difficult. 

There  is  an  interesting  connection  between  the  problem  of  learning  a  curve  discussed  here 
and  a  problem  of  computing  the  length  of  curves  from  discrete  approximations.  In  partictdar,  it 
can  be  shown  that  computing  the  length  of  a  curve  from  its  digitization  on  a  rectangular  grid 
requires  a  nonlocal  computation  (even  for  just  straight  line  segments),  although  computing  the 
length  of  a  line  segment  from  discrete  approximations  on  a  random  tesselation  can  be  done  locally 
[9].  The  construction  is  essentially  a  learning  problem  with  intersection  samples  from  random 
straight  lines.  Furthermore,  the  construction  provides  insight  as  to  why  local  computation  fails 
for  a  rectangular  digitization  and  suggests  that  appropriate  deterministic  digitizations  would  stiU 
allow  local  computations.  This  is  related  to  the  work  in  [11]. 

We  considered  here  only  one  particular  example  of  learning  from  generalized  samples.  However, 
we  expect  that  this  framework  can  be  applied  to  a  number  of  problems  in  signal/image  processing, 
geometric  reconstruction,  stereology,  etc.,  to  provide  learnability  results  and  sample  size  bounds 
under  a  PAC  criterion.  As  previously  mentioned,  learning  with  generalized  samples  is  in  essence 
simply  a  transformation  to  a  different  standard  learning  problem,  although  the  variety  available  in 
choosing  this  transformation  (i.e.,  the  form  of  the  generalized  samples)  should  allow  the  learning 
framework  and  results  to  be  applied  to  a  broad  range  of  problems. 

For  example,  the  generalized  samples  coidd  consist  of  drawing  certain  random  sets  and  return¬ 
ing  the  integral  of  the  concept  over  these  sets.  Other  possibilities  might  be  to  rettun  weighted 
integrals  of  the  concept  where  the  weighting  function  is  selected  randomly  from  a  suitable  set  (e.g., 
an  orthonormal  basis),  or  to  sample  derivatives  of  the  concept  at  remdom  points.  One  interesting 
application  would  be  to  problems  in  tomographic  reconstruction.  In  these  problems,  one  is  inter¬ 
ested  in  reconstructing  a  function  from  a  set  of  projections  of  the  function  onto  lower  dimensional 
subspaces.  One  could  have  the  generalized  samples  consist  of  drawing  random  lines  labeled  accord¬ 
ing  to  the  integral  of  the  unknown  function  along  the  line.  This  would  correspond  to  a  problem 
in  tomographic  reconstruction  with  random  ray  sampling.  Alternatively,  as  previously  mentioned, 
one  cotdd  combine  the  general  framework  discussed  by  Haussler  [6]  with  generalized  samples,  and 
consider  an  application  to  tomography  where  the  generalized  samples  consist  of  entire  projections. 
This  would  be  more  in  line  with  standard  problems  in  tomography,  but  with  the  directions  of  the 
projections  being  chosen  randomly. 

For  more  geometric  problems  in  which  the  concepts  are  subsets  of  X,  some  interesting  gen¬ 
eralized  samples  might  be  to  draw  random  (parameterized)  subsets  (e.g.,  disks,  lines,  or  other 
parameterized  curves)  of  X  labeled  as  to  whether  or  not  the  random  set  intersects  or  is  contained 
in  the  target  concept.  Other  possibilities  might  be  to  label  the  random  set  as  to  the  number  of 
intersections  (or  length,  Jirea,  or  volume  of  the  intersection,  as  appropriate)  with  the  unknown 
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concept.  One  interesting  application  to  consider  would  be  the  reconstruction  of  a  convex  set  from 
various  types  data  (e.g.,  see  [7,  16,  10,  13]).  For  example,  the  generalized  samples  could  be  random 
lines  labeled  as  to  whether  or  not  they  intersect  the  convex  set  (which  would  provide  bounds  on 
the  support  function).  This  is  actually  just  a  special  case  of  learning  a  curve  which  is  closed  and 
convex,  although  tighter  bounds  should  be  obtainable  due  to  the  added  restrictions.  Alternatively, 
the  lines  could  be  labeled  as  to  the  length  of  the  intersection  (which  is  like  the  tomography  problem 
with  random  ray  sampling  in  the  case  of  binary  objects).  A  third  possibility  (which  is  actually  just 
learning  from  standard  samples)  would  be  to  obtain  samples  of  the  support  function. 

Formulating  learning  from  generalized  samples  in  the  general  framework  of  Haussler  [6]  allows 
issues  such  as  noisy  samples  to  be  treated  in  a  imified  framework.  Application  of  the  framework 
to  a  particular  problem  reduces  the  question  of  estimation/learning  under  a  PAC  criterion  to  a 
metric  entropy  (or  generalized  VC  dimension)  computation.  This  is  not  meant  to  imply  that 
such  a  computation  is  easy.  On  the  contrary,  the  metric  entropy  computation  is  the  essence  of 
the  problem  and  can  be  quite  difficult.  Another  problem  which  can  be  diffictdt  is  interpreting  the 
learning  criterion  on  the  original  space  induced  by  the  distribution  on  the  generalized  samples.  The 
induced  metric  is  a  natural  one  given  the  type  of  information  available,  but  it  may  be  difficult  to 
understand  the  properties  it  endows  on  the  original  concept  class.  Finally,  although  this  approach 
may  provide  sample  size  bounds  for  a  variety  of  problems,  it  leaves  wide  open  the  question  of 
finding  good  algorithms. 
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